Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.
translated by 谷歌翻译
This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.
translated by 谷歌翻译
Transformer-based large language models are trained to make predictions about the next word by aggregating representations of previous tokens through their self-attention mechanism. In the field of cognitive modeling, such attention patterns have recently been interpreted as embodying the process of cue-based retrieval, in which attention over multiple targets is taken to generate interference and latency during retrieval. Under this framework, this work first defines an entropy-based predictor that quantifies the diffuseness of self-attention, as well as distance-based predictors that capture the incremental change in attention patterns across timesteps. Moreover, following recent studies that question the informativeness of attention weights, we also experiment with alternative methods for incorporating vector norms into attention weights. Regression experiments using predictors calculated from the GPT-2 language model show that these predictors deliver a substantially better fit to held-out self-paced reading and eye-tracking data over a rigorous baseline including GPT-2 surprisal. Additionally, the distance-based predictors generally demonstrated higher predictive power, with effect sizes of up to 6.59 ms per standard deviation on self-paced reading times (compared to 2.82 ms for surprisal) and 1.05 ms per standard deviation on eye-gaze durations (compared to 3.81 ms for surprisal).
translated by 谷歌翻译
Hawkes processes have recently risen to the forefront of tools when it comes to modeling and generating sequential events data. Multidimensional Hawkes processes model both the self and cross-excitation between different types of events and have been applied successfully in various domain such as finance, epidemiology and personalized recommendations, among others. In this work we present an adaptation of the Frank-Wolfe algorithm for learning multidimensional Hawkes processes. Experimental results show that our approach has better or on par accuracy in terms of parameter estimation than other first order methods, while enjoying a significantly faster runtime.
translated by 谷歌翻译
Incorporating computed tomography (CT) reconstruction operators into differentiable pipelines has proven beneficial in many applications. Such approaches usually focus on the projection data and keep the acquisition geometry fixed. However, precise knowledge of the acquisition geometry is essential for high quality reconstruction results. In this paper, the differentiable formulation of fan-beam CT reconstruction is extended to the acquisition geometry. This allows to propagate gradient information from a loss function on the reconstructed image into the geometry parameters. As a proof-of-concept experiment, this idea is applied to rigid motion compensation. The cost function is parameterized by a trained neural network which regresses an image quality metric from the motion affected reconstruction alone. Using the proposed method, we are the first to optimize such an autofocus-inspired algorithm based on analytical gradients. The algorithm achieves a reduction in MSE by 35.5 % and an improvement in SSIM by 12.6 % over the motion affected reconstruction. Next to motion compensation, we see further use cases of our differentiable method for scanner calibration or hybrid techniques employing deep models.
translated by 谷歌翻译
The Concept Bottleneck Models (CBMs) of Koh et al. [2020] provide a means to ensure that a neural network based classifier bases its predictions solely on human understandable concepts. The concept labels, or rationales as we refer to them, are learned by the concept labeling component of the CBM. Another component learns to predict the target classification label from these predicted concept labels. Unfortunately, these models are heavily reliant on human provided concept labels for each datapoint. To enable CBMs to behave robustly when these labels are not readily available, we show how to equip them with the ability to abstain from predicting concepts when the concept labeling component is uncertain. In other words, our model learns to provide rationales for its predictions, but only whenever it is sure the rationale is correct.
translated by 谷歌翻译
Concept bottleneck models perform classification by first predicting which of a list of human provided concepts are true about a datapoint. Then a downstream model uses these predicted concept labels to predict the target label. The predicted concepts act as a rationale for the target prediction. Model trust issues emerge in this paradigm when soft concept labels are used: it has previously been observed that extra information about the data distribution leaks into the concept predictions. In this work we show how Monte-Carlo Dropout can be used to attain soft concept predictions that do not contain leaked information.
translated by 谷歌翻译
科学家经常使用观察时间序列数据来研究从气候变化到民间冲突再到大脑活动的复杂自然过程。但是对这些数据的回归分析通常假定简单的动态。深度学习的最新进展使从语音理解到核物理学再到竞争性游戏的复杂过程模型的表现实现了令人震惊的改进。但是深度学习通常不用于科学分析。在这里,我们通过证明可以使用深度学习,不仅可以模仿,而且可以分析复杂的过程,在保留可解释性的同时提供灵活的功能近似。我们的方法 - 连续时间反向逆转回归神经网络(CDRNN) - 放宽标准简化的假设(例如,线性,平稳性和同质性)对于许多自然系统来说是不可信的,并且可能会严重影响数据的解释。我们评估CDRNNS对人类语言处理,这是一个具有复杂连续动态的领域。我们证明了行为和神经影像数据中预测可能性的显着改善,我们表明CDRNN可以在探索性分析中灵活发现新型模式,在确认分析中对可能的混杂性提供强有力的控制,并打开否则就可以使用这些问题来进行研究,这些问题否则就可以使用这些问题来进行研究,而这些问题否则就可以使用这些问题进行研究,而这些问题否则就可以使用这些问题进行研究。观察数据。
translated by 谷歌翻译
心室心动过速(VT)可能是全世界425万人心脏死亡的原因之一。治疗方法是导管消融,以使异常触发区域失活。为了促进和加快消融过程中的定位,我们提出了基于卷积神经网络(CNN)的两种新型定位技术。与现有方法相反,例如使用ECG成像,我们的方法被设计为独立于患者特异性的几何形状,直接适用于表面ECG信号,同时还提供了二元透射位置。一种方法输出排名的替代解决方案。可以在通用或患者的几何形状上可视化结果。对CNN进行了仅包含模拟数据的数据集培训,并在模拟和临床测试数据上进行了评估。在模拟数据上,中值测试误差低于3mm。临床数据上的中位定位误差低至32mm。在所有临床病例中,多达82%的透壁位置被正确检测到。使用排名的替代溶液,在临床数据上,前3个中值误差下降到20mm。这些结果证明了原理证明使用CNN来定位激活源,而无需固有的患者特定的几何信息。此外,提供多种解决方案可以帮助医生在多个可能的位置中找到实际激活源。通过进一步的优化,这些方法具有加快临床干预措施的高潜力。因此,他们可以降低程序风险并改善VT患者的结局。
translated by 谷歌翻译
在过去的几年中,霍克斯流程的在线学习受到了越来越多的关注,尤其是用于建模演员网络。但是,这些作品通常会模拟事件或参与者的潜在群集之间的丰富相互作用,或者是参与者之间的网络结构。我们建议对参与者网络的潜在结构进行建模,以及在现实世界中的医疗和财务应用环境中进行的丰富互动。合成和现实世界数据的实验结果展示了我们方法的功效。
translated by 谷歌翻译